04 - 09 - 2023
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex1.png')
plt.imshow(img)
plt.show()
A chart is basically made of visual encoding and chart apparatus. Chart apparatus refers to features related to the type of chart that has ben chosen, while the visual encoding elements are customized. In this case, the chart is a bubble plot, showing the changes in spending from a specific year (2019) and a specific week (the one that ends with April 1).
Fist, we need to make a first distinction in the visual encoding elements, between marks and annotations (or attributes):
In this case, marks are dots representing:
As mentioned before, annotations could be quantitative or categorical attributes. In this case, a quantitative attribute present in this graph are is the position of the dots respect to the center of the chart, that suggest the variation of the spendings in percentage respect to the reference measure. It is also present a categorical attribute, the color of the dots, that suggest if a certain expanse is higher or lower respect to the compared expanse.
Finally, we can notice that there are no legend in this graph helping us, but there are some text annotations that identify certain categories of expanse.
There are three types of visual experience:
Explanatory visual experience aim to offer a detailed and comprehensive understanding of a topic through visuals. Thus, it takes the responsibility to bring key insights in the foreground rather then leaving the responsibility to the viewer. The main goal is a self-explanatory chart, the reader should not need in-person explanation. This type of visual experience typically involve illustrations, diagrams or animations to guide the viewer. Annotations play an important role in this case, since colors, captions and annotations assist the viewer in the interpretation.
Exhibitory visual experience is focused on presenting information/data in a clear and visually good way. In this case, no exploration, no interaction and no explanations are involved. Typically charts, graphs, diagrams and infographics are involved. The viewer has to interpret the meaning on his own and he need to know the context and the content, since this visual experience is often offered to a very specific audience or support written articles/reports.
Exploratory visual experience allows users to interact and explore the data. Here the viewer still nedd to find his own insights, but it is assisted by an interactive visualization. Typically, this type of visual experience, lets the user highlight or filter by categories of interest, change data parameters, switch views, get annotations hovering on components and so on. A nice way to explore data in depth.
Here are an example for each visual experience previously explained:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_1.png')
plt.imshow(img)
plt.show()
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_2.png')
plt.imshow(img)
plt.show()
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_3.png')
plt.imshow(img)
plt.show()
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex2_4.png')
plt.imshow(img)
plt.show()
UMAP (Uniform Manifold Apporximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are both dimensionality reduction techniques. The explain the sentence, it is necessary to go step-by-step. The main goal of this two techniques is to reduce demension of high-dimensional data, preserving structure and relationship. To do so, UMAP transform the original high-dimensional space into a lower-dimensional space. UMAP tries to find a lower-dimensional representation of the data, that still captures the relationship (previously present) between data points. Thus, the sentence says "induce a space transformation" because UMAP construct a low-dimensional representation of the data, while t-SNE map the data points from the high-dimensional space to a lower-dimensional space preserving the similarities between points and don't explicitly construct a lower-dimensional representation.
Here are some comparisons between UMAP and t-SNE (in biology).
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
img = mpimg.imread('img/ex3.png')
plt.imshow(img)
plt.show()
Both perform well, but UMAP seems to be able to accentuate and highlight better clusters and relationship respect to t-SNE. Anyway, I surely need a more in-depth search to better show the differences between these two techniques.
# imports
import pandas as pd
# read cdv
employment = pd.read_csv('datasets/employment_italy.csv')
# visualize data
employment.head()
| ITTER107 | Territorio | TIPO_DATO_FOL | Tipo dato | SEXISTAT1 | Sesso | ETA1 | Classe di età | TITOLO_STUDIO | Titolo di studio | CITTADINANZA | Cittadinanza | TIME | Seleziona periodo | Value | Flag Codes | Flags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ITC1 | Piemonte | EMP_R | tasso di occupazione | 1 | maschi | Y15-24 | 15-24 anni | 99 | totale | TOTAL | totale | 2018 | 2018 | 25.101074 | NaN | NaN |
| 1 | ITC1 | Piemonte | EMP_R | tasso di occupazione | 1 | maschi | Y15-24 | 15-24 anni | 99 | totale | TOTAL | totale | 2019 | 2019 | 23.817206 | NaN | NaN |
| 2 | ITC1 | Piemonte | EMP_R | tasso di occupazione | 1 | maschi | Y15-24 | 15-24 anni | 99 | totale | TOTAL | totale | 2020 | 2020 | 24.314688 | NaN | NaN |
| 3 | ITC1 | Piemonte | EMP_R | tasso di occupazione | 1 | maschi | Y15-24 | 15-24 anni | 99 | totale | TOTAL | totale | 2021 | 2021 | 25.062330 | NaN | NaN |
| 4 | ITC1 | Piemonte | EMP_R | tasso di occupazione | 1 | maschi | Y15-24 | 15-24 anni | 99 | totale | TOTAL | totale | 2022 | 2022 | 23.491588 | NaN | NaN |
for i in range(len(employment)):
if employment['Territorio'][i] == 'Trentino Alto Adige / Südtirol':
employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
if employment['Territorio'][i] == 'Provincia Autonoma Bolzano / Bozen':
employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
if employment['Territorio'][i] == 'Provincia Autonoma Trento':
employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
if employment['Territorio'][i] == "Valle d'Aosta / Vallée d'Aoste":
employment['Territorio'][i] = "Valle d'Aosta/Vallée d'Aoste"
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy employment['Territorio'][i] = "Valle d'Aosta/Vallée d'Aoste" /var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol' /var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol' /var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3869388131.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy employment['Territorio'][i] = 'Trentino-Alto Adige/Südtirol'
# prepare data for bar plot
employment_female = employment[employment['Sesso'] == 'femmine']
employment_female_20_64 = employment_female[employment_female['ETA1'] == 'Y20-64']
employment_graduated_female = employment_female_20_64[employment_female_20_64['Titolo di studio'] == 'laurea e post-laurea']
regions = employment_graduated_female['Territorio'].unique().tolist()
employment_graduated_female
| ITTER107 | Territorio | TIPO_DATO_FOL | Tipo dato | SEXISTAT1 | Sesso | ETA1 | Classe di età | TITOLO_STUDIO | Titolo di studio | CITTADINANZA | Cittadinanza | TIME | Seleziona periodo | Value | Flag Codes | Flags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6452 | ITG1 | Sicilia | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2018 | 2018 | 61.598687 | NaN | NaN |
| 6453 | ITG1 | Sicilia | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2019 | 2019 | 63.168572 | NaN | NaN |
| 6454 | ITG1 | Sicilia | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2020 | 2020 | 63.525725 | NaN | NaN |
| 6455 | ITG1 | Sicilia | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2021 | 2021 | 65.130276 | NaN | NaN |
| 6456 | ITG1 | Sicilia | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 66.394632 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7052 | ITD3 | Veneto | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2018 | 2018 | 80.715351 | NaN | NaN |
| 7053 | ITD3 | Veneto | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2019 | 2019 | 79.953319 | NaN | NaN |
| 7054 | ITD3 | Veneto | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2020 | 2020 | 75.534940 | NaN | NaN |
| 7055 | ITD3 | Veneto | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2021 | 2021 | 81.357479 | NaN | NaN |
| 7056 | ITD3 | Veneto | EMP_R | tasso di occupazione | 2 | femmine | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 83.508957 | NaN | NaN |
110 rows × 17 columns
# prepare data for choropleth map
employment_2022 = employment[employment['TIME'] == '2022']
employment_graduated_2022 = employment_2022[employment_2022['Titolo di studio'] == 'laurea e post-laurea']
employment_2022_20_64 = employment_graduated_2022[employment_graduated_2022['ETA1'] == 'Y20-64']
employment_2022_20_64.head()
| ITTER107 | Territorio | TIPO_DATO_FOL | Tipo dato | SEXISTAT1 | Sesso | ETA1 | Classe di età | TITOLO_STUDIO | Titolo di studio | CITTADINANZA | Cittadinanza | TIME | Seleziona periodo | Value | Flag Codes | Flags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6406 | ITC2 | Valle d'Aosta/Vallée d'Aoste | EMP_R | tasso di occupazione | 9 | totale | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 84.496008 | NaN | NaN |
| 6411 | ITD4 | Friuli-Venezia Giulia | EMP_R | tasso di occupazione | 9 | totale | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 84.298419 | NaN | NaN |
| 6416 | ITDA | Trentino-Alto Adige/Südtirol | EMP_R | tasso di occupazione | 9 | totale | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 85.271396 | NaN | NaN |
| 6431 | ITF4 | Puglia | EMP_R | tasso di occupazione | 9 | totale | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 73.421410 | NaN | NaN |
| 6436 | ITF3 | Campania | EMP_R | tasso di occupazione | 9 | totale | Y20-64 | 20-64 anni | 11 | laurea e post-laurea | TOTAL | totale | 2022 | 2022 | 71.129689 | NaN | NaN |
# show regions
regions
['Sicilia', 'Calabria', 'Toscana', 'Friuli-Venezia Giulia', "Valle d'Aosta/Vallée d'Aoste", 'Trentino-Alto Adige/Südtirol', 'Puglia', 'Campania', 'Abruzzo', 'Lazio', 'Umbria', 'Emilia-Romagna', 'Liguria', 'Lombardia', 'Marche', 'Basilicata', 'Molise', 'Piemonte', 'Sardegna', 'Veneto']
# imports
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
# plot: employment rate in Italy - male vs. female
fig, ax1 = plt.subplots(1, 1, figsize=(25, 10))
plt.suptitle('Employment rate of graduated female (y: 20-64) in Italy', fontsize=25)
sns.barplot(
ax = ax1,
data = employment_graduated_female,
x = 'Territorio',
y = 'Value',
hue = 'TIME'
)
ax1.set_xlabel("Region", fontsize=15)
ax1.set_ylabel("Employment rate", fontsize=15)
ax1.set_xticks(np.arange(0, len(regions), 1))
ax1.set_xticklabels(regions, rotation=90, fontsize=10)
ax1.grid(alpha=.4)
fig.tight_layout()
fig.show()
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/1395905003.py:21: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
import json
with open('geojson/limits_IT_regions.geojson') as f:
italy = json.load(f)
for feature in italy['features']:
print(feature['properties']['reg_name'])
Piemonte Valle d'Aosta/Vallée d'Aoste Lombardia Trentino-Alto Adige/Südtirol Veneto Friuli-Venezia Giulia Liguria Emilia-Romagna Toscana Umbria Marche Lazio Abruzzo Molise Campania Puglia Basilicata Calabria Sicilia Sardegna
# get min and max for 'Value'
min_value = employment_2022_20_64['Value'].min()
max_value = employment_2022_20_64['Value'].max()
fig2 = px.choropleth_mapbox(
employment_graduated_female,
geojson=italy,
locations='Territorio',
featureidkey='properties.reg_name',
color='Value',
color_continuous_scale="Viridis",
range_color=(min_value, max_value),
labels={'Value':'Employment rate', 'Territorio':'Region'},
title="Employment rate of graduated female (20-64) in Italy (2022)",
hover_data=['Territorio', 'Value'],
center={"lat": 41.8719, "lon": 12.5674},
mapbox_style="carto-positron",
zoom=4
)
fig2.update_geos(showcountries=False,
showcoastlines=False,
showland=False,
fitbounds='locations')
fig2.update_layout(margin={"r":0, "t":40, "l":0, "b":0})
fig2.show()
# imports
import pandas as pd
# read csv
ionosphere = pd.read_csv('datasets/ionosphere.csv', header=None)
ionosphere
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.99539 | -0.05889 | 0.85243 | 0.02306 | 0.83398 | -0.37708 | 1.00000 | 0.03760 | 0.85243 | -0.17755 | ... | -0.51171 | 0.41078 | -0.46168 | 0.21266 | -0.34090 | 0.42267 | -0.54487 | 0.18641 | -0.45300 | g |
| 1 | 1.00000 | -0.18829 | 0.93035 | -0.36156 | -0.10868 | -0.93597 | 1.00000 | -0.04549 | 0.50874 | -0.67743 | ... | -0.26569 | -0.20468 | -0.18401 | -0.19040 | -0.11593 | -0.16626 | -0.06288 | -0.13738 | -0.02447 | b |
| 2 | 1.00000 | -0.03365 | 1.00000 | 0.00485 | 1.00000 | -0.12062 | 0.88965 | 0.01198 | 0.73082 | 0.05346 | ... | -0.40220 | 0.58984 | -0.22145 | 0.43100 | -0.17365 | 0.60436 | -0.24180 | 0.56045 | -0.38238 | g |
| 3 | 1.00000 | -0.45161 | 1.00000 | 1.00000 | 0.71216 | -1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | ... | 0.90695 | 0.51613 | 1.00000 | 1.00000 | -0.20099 | 0.25682 | 1.00000 | -0.32382 | 1.00000 | b |
| 4 | 1.00000 | -0.02401 | 0.94140 | 0.06531 | 0.92106 | -0.23255 | 0.77152 | -0.16399 | 0.52798 | -0.20275 | ... | -0.65158 | 0.13290 | -0.53206 | 0.02431 | -0.62197 | -0.05707 | -0.59573 | -0.04608 | -0.65697 | g |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 346 | 0.83508 | 0.08298 | 0.73739 | -0.14706 | 0.84349 | -0.05567 | 0.90441 | -0.04622 | 0.89391 | 0.13130 | ... | -0.04202 | 0.83479 | 0.00123 | 1.00000 | 0.12815 | 0.86660 | -0.10714 | 0.90546 | -0.04307 | g |
| 347 | 0.95113 | 0.00419 | 0.95183 | -0.02723 | 0.93438 | -0.01920 | 0.94590 | 0.01606 | 0.96510 | 0.03281 | ... | 0.01361 | 0.93522 | 0.04925 | 0.93159 | 0.08168 | 0.94066 | -0.00035 | 0.91483 | 0.04712 | g |
| 348 | 0.94701 | -0.00034 | 0.93207 | -0.03227 | 0.95177 | -0.03431 | 0.95584 | 0.02446 | 0.94124 | 0.01766 | ... | 0.03193 | 0.92489 | 0.02542 | 0.92120 | 0.02242 | 0.92459 | 0.00442 | 0.92697 | -0.00577 | g |
| 349 | 0.90608 | -0.01657 | 0.98122 | -0.01989 | 0.95691 | -0.03646 | 0.85746 | 0.00110 | 0.89724 | -0.03315 | ... | -0.02099 | 0.89147 | -0.07760 | 0.82983 | -0.17238 | 0.96022 | -0.03757 | 0.87403 | -0.16243 | g |
| 350 | 0.84710 | 0.13533 | 0.73638 | -0.06151 | 0.87873 | 0.08260 | 0.88928 | -0.09139 | 0.78735 | 0.06678 | ... | -0.15114 | 0.81147 | -0.04822 | 0.78207 | -0.00703 | 0.75747 | -0.06678 | 0.85764 | -0.06151 | g |
351 rows × 33 columns
# prepare fit data
fit_data = ionosphere.iloc[:, :-1]
fit_data.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.99539 | -0.05889 | 0.85243 | 0.02306 | 0.83398 | -0.37708 | 1.00000 | 0.03760 | 0.85243 | -0.17755 | ... | 0.56811 | -0.51171 | 0.41078 | -0.46168 | 0.21266 | -0.34090 | 0.42267 | -0.54487 | 0.18641 | -0.45300 |
| 1 | 1.00000 | -0.18829 | 0.93035 | -0.36156 | -0.10868 | -0.93597 | 1.00000 | -0.04549 | 0.50874 | -0.67743 | ... | -0.20332 | -0.26569 | -0.20468 | -0.18401 | -0.19040 | -0.11593 | -0.16626 | -0.06288 | -0.13738 | -0.02447 |
| 2 | 1.00000 | -0.03365 | 1.00000 | 0.00485 | 1.00000 | -0.12062 | 0.88965 | 0.01198 | 0.73082 | 0.05346 | ... | 0.57528 | -0.40220 | 0.58984 | -0.22145 | 0.43100 | -0.17365 | 0.60436 | -0.24180 | 0.56045 | -0.38238 |
| 3 | 1.00000 | -0.45161 | 1.00000 | 1.00000 | 0.71216 | -1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | ... | 1.00000 | 0.90695 | 0.51613 | 1.00000 | 1.00000 | -0.20099 | 0.25682 | 1.00000 | -0.32382 | 1.00000 |
| 4 | 1.00000 | -0.02401 | 0.94140 | 0.06531 | 0.92106 | -0.23255 | 0.77152 | -0.16399 | 0.52798 | -0.20275 | ... | 0.03286 | -0.65158 | 0.13290 | -0.53206 | 0.02431 | -0.62197 | -0.05707 | -0.59573 | -0.04608 | -0.65697 |
5 rows × 32 columns
from sklearn.decomposition import PCA
# apply pca to ionosphere with 2 components
pca = PCA(n_components=2)
pca_result = pca.fit_transform(fit_data)
ionosphere['pca-one'] = pca_result[:, 0]
ionosphere['pca-two'] = pca_result[:, 1]
# start to try different t-SNE: perplexity = 5
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=5, n_iter=300)
tsne_results = tsne.fit_transform(fit_data)
ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]
# view t-SNE with perplexity = 5
plt.figure(figsize=(16, 10))
plt.suptitle('t-SNE with perplexity = 5', fontsize=25)
sns.scatterplot(
x="tsne-2d-one", y="tsne-2d-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
)
plt.show()
[t-SNE] Computing 16 nearest neighbors... [t-SNE] Indexed 351 samples in 0.001s... [t-SNE] Computed neighbors for 351 samples in 0.050s... [t-SNE] Computed conditional probabilities for sample 351 / 351 [t-SNE] Mean sigma: 0.281231 [t-SNE] KL divergence after 250 iterations with early exaggeration: 66.277557 [t-SNE] KL divergence after 300 iterations: 1.039823
# t-SNE with perplexity 25
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=25, n_iter=300)
tsne_results = tsne.fit_transform(fit_data)
ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]
# view t-SNE with perplexity = 25
plt.figure(figsize=(16, 10))
plt.suptitle('t-SNE with perplexity = 25', fontsize=25)
sns.scatterplot(
x="tsne-2d-one", y="tsne-2d-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
)
plt.show()
[t-SNE] Computing 76 nearest neighbors... [t-SNE] Indexed 351 samples in 0.000s... [t-SNE] Computed neighbors for 351 samples in 0.007s... [t-SNE] Computed conditional probabilities for sample 351 / 351 [t-SNE] Mean sigma: 0.588407 [t-SNE] KL divergence after 250 iterations with early exaggeration: 54.903748 [t-SNE] KL divergence after 300 iterations: 0.619860
# t-SNE with perplexity 40
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(fit_data)
ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]
# view t-SNE with perplexity = 40
plt.figure(figsize=(16, 10))
plt.suptitle('t-SNE with perplexity = 40', fontsize=25)
sns.scatterplot(
x="tsne-2d-one", y="tsne-2d-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
)
plt.show()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 351 samples in 0.001s... [t-SNE] Computed neighbors for 351 samples in 0.009s... [t-SNE] Computed conditional probabilities for sample 351 / 351 [t-SNE] Mean sigma: 0.747613 [t-SNE] KL divergence after 250 iterations with early exaggeration: 50.705322 [t-SNE] KL divergence after 300 iterations: 0.505443
# t-sne with perplexity = 25 and iteration 1000
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=25, n_iter=1000)
tsne_results = tsne.fit_transform(fit_data)
ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]
# view t-SNE with perplexity = 25
plt.figure(figsize=(16, 10))
plt.suptitle('t-SNE with perplexity = 25 and iteration 1000', fontsize=25)
sns.scatterplot(
x="tsne-2d-one", y="tsne-2d-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
)
plt.show()
[t-SNE] Computing 76 nearest neighbors... [t-SNE] Indexed 351 samples in 0.000s... [t-SNE] Computed neighbors for 351 samples in 0.006s... [t-SNE] Computed conditional probabilities for sample 351 / 351 [t-SNE] Mean sigma: 0.588407 [t-SNE] KL divergence after 250 iterations with early exaggeration: 54.303207 [t-SNE] KL divergence after 1000 iterations: 0.540016
# t-sne with perplexity = 50 and iteration 1000
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=1000)
tsne_results = tsne.fit_transform(fit_data)
ionosphere['tsne-2d-one'] = tsne_results[:, 0]
ionosphere['tsne-2d-two'] = tsne_results[:, 1]
# view t-SNE with perplexity = 50
plt.figure(figsize=(16, 10))
plt.suptitle('t-SNE with perplexity = 50 and iteration 1000', fontsize=25)
sns.scatterplot(
x="tsne-2d-one", y="tsne-2d-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
)
plt.show()
[t-SNE] Computing 151 nearest neighbors... [t-SNE] Indexed 351 samples in 0.001s... [t-SNE] Computed neighbors for 351 samples in 0.009s... [t-SNE] Computed conditional probabilities for sample 351 / 351 [t-SNE] Mean sigma: 0.850265 [t-SNE] KL divergence after 250 iterations with early exaggeration: 49.105682 [t-SNE] KL divergence after 1000 iterations: 0.380867
# print PCA and t-SNE
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(25, 15))
fig.suptitle('PCA vs. t-SNE', fontsize=25)
sns.scatterplot(
x="pca-one", y="pca-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
ax=ax1
)
sns.scatterplot(
x="tsne-2d-one", y="tsne-2d-two",
hue=32,
palette=sns.color_palette("hls", 2),
data=ionosphere,
legend="full",
alpha=0.7,
ax=ax2
)
fig.tight_layout()
fig.show()
/var/folders/gt/xktrgszx29l2zq8y2jzf_twr0000gn/T/ipykernel_3836/3003667294.py:28: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure.
From this graph, we can notice that t-SNE performs better for our dataset. Surely we need a more accurate testing to absolutely state which is the better for our dataset. However it is not easy to find the optimized parameters for t-SNE. Apparently, it seems that a good compromise is a perplexity of 40 with a lower number of iteration respect to the last runs with a high number of iterations.
# imports
import pandas as pd
import plotly.express as px
# read csv
data = pd.read_csv('datasets/mydata.csv')
data['Variable'] = data['Variable'].astype(float)
# import geojson
import json
with open('geojson/usa.geo.json') as f:
usa = json.load(f)
for feature in usa['features']:
print(feature['properties']['NAME'])
len(usa['features'])
Maine Massachusetts Michigan Montana Nevada New Jersey New York North Carolina Ohio Pennsylvania Rhode Island Tennessee Texas Utah Washington Wisconsin Puerto Rico Maryland Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Minnesota Mississippi Missouri Nebraska New Hampshire New Mexico North Dakota Oklahoma Oregon South Carolina South Dakota Vermont Virginia West Virginia Wyoming
52
# get states from dataset
states = data['Name'].unique().tolist()
len(states)
50
# get min and max for variable
min_var = data['Variable'].min()
max_var = data['Variable'].max()
fig = px.choropleth(data,
geojson=usa,
featureidkey='properties.NAME',
locations='Name',
color='Variable',
color_continuous_scale="BuPu",
range_color=(min_var, max_var),
scope="usa",
labels={'Variable':'Variable'},
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.update_geos(
lataxis_showgrid=True,
lonaxis_showgrid=True,
bgcolor="lightgrey",
landcolor="black")
fig.show()